Metadata Management
Impact and influence of modern AI in metadata management
Yang, Wenli, Fu, Rui, Amin, Muhammad Bilal, Kang, Byeong
Metadata management plays a critical role in data governance, resource discovery, and decision-making in the data-driven era. While traditional metadata approaches have primarily focused on organization, classification, and resource reuse, the integration of modern artificial intelligence (AI) technologies has significantly transformed these processes. This paper investigates both traditional and AI-driven metadata approaches by examining open-source solutions, commercial tools, and research initiatives. A comparative analysis of traditional and AI-driven metadata management methods is provided, highlighting existing challenges and their impact on next-generation datasets. The paper also presents an innovative AI-assisted metadata management framework designed to address these challenges. This framework leverages more advanced modern AI technologies to automate metadata generation, enhance governance, and improve the accessibility and usability of modern datasets. Finally, the paper outlines future directions for research and development, proposing opportunities to further advance metadata management in the context of AI-driven innovation and complex datasets.
Linking the Dynamic PicoProbe Analytical Electron-Optical Beam Line / Microscope to Supercomputers
Brace, Alexander, Vescovi, Rafael, Chard, Ryan, Saint, Nickolaus D., Ramanathan, Arvind, Zaluzec, Nestor J., Foster, Ian
The Dynamic PicoProbe at Argonne National Laboratory is undergoing upgrades that will enable it to produce up to 100s of GB of data per day. While this data is highly important for both fundamental science and industrial applications, there is currently limited on-site infrastructure to handle these high-volume data streams. We address this problem by providing a software architecture capable of supporting large-scale data transfers to the neighboring supercomputers at the Argonne Leadership Computing Facility. To prepare for future scientific workflows, we implement two instructive use cases for hyperspectral and spatiotemporal datasets, which include: (i) off-site data transfer, (ii) machine learning/artificial intelligence and traditional data analysis approaches, and (iii) automatic metadata extraction and cataloging of experimental results. This infrastructure supports expected workloads and also provides domain scientists the ability to reinterrogate data from past experiments to yield additional scientific value and derive new insights.
MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries
Choudhury, Muntabir Hasan, Salsabil, Lamia, Jayanetti, Himarsha R., Wu, Jian, Ingram, William A., Fox, Edward A.
Metadata quality is crucial for digital objects to be discovered through digital library interfaces. However, due to various reasons, the metadata of digital objects often exhibits incomplete, inconsistent, and incorrect values. We investigate methods to automatically detect, correct, and canonicalize scholarly metadata, using seven key fields of electronic theses and dissertations (ETDs) as a case study. We propose MetaEnhance, a framework that utilizes state-of-the-art artificial intelligence methods to improve the quality of these fields. To evaluate MetaEnhance, we compiled a metadata quality evaluation benchmark containing 500 ETDs, by combining subsets sampled using multiple criteria. We tested MetaEnhance on this benchmark and found that the proposed methods achieved nearly perfect F1-scores in detecting errors and F1-scores in correcting errors ranging from 0.85 to 1.00 for five of seven fields.
Toward a Flexible Metadata Pipeline for Fish Specimen Images
Jebbia, Dom, Wang, Xiaojun, Bakis, Yasin, Bart, Henry L. Jr., Greenberg, Jane
Flexible metadata pipelines are crucial for supporting the FAIR data principles. Despite this need, researchers seldom report their approaches for identifying metadata standards and protocols that support optimal flexibility. This paper reports on an initiative targeting the development of a flexible metadata pipeline for a collection containing over 300,000 digital fish specimen images, harvested from multiple data repositories and fish collections. The images and their associated metadata are being used for AI-related scientific research involving automated species identification, segmentation and trait extraction. The paper provides contextual background, followed by the presentation of a four-phased approach involving: 1. Assessment of the Problem, 2. Investigation of Solutions, 3. Implementation, and 4. Refinement. The work is part of the NSF Harnessing the Data Revolution, Biology Guided Neural Networks (NSF/HDR-BGNN) project and the HDR Imageomics Institute. An RDF graph prototype pipeline is presented, followed by a discussion of research implications and conclusion summarizing the results.
The Variable Quality of Metadata About Biological Samples Used in Biomedical Experiments
Gonçalves, Rafael S., Musen, Mark A.
We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well- known databases: BioSample---a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples---a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.
CiteSeerX: AI in a Digital Library Search Engine
Wu, Jian (Pennsylvania State University) | Williams, Kyle Mark (Pennsylvania State University) | Chen, Hung-Hsuan (Industrial Technology Research Institute) | Khabsa, Madian (Pennsylvania State University) | Caragea, Cornelia (University of North Texas) | Tuarob, Suppawong (Pennsylvania State University) | Ororbia, Alexander G. (Pennsylvania State University) | Jordan, Douglas (Pennsylvania State University) | Mitra, Prasenjit (Pennsylvania State University) | Giles, C. Lee (Pennsylvania State University)
Since then, the project has been directed by C. Lee Giles. While it is challenging to rebuild a system like Cite-SeerX from scratch, many of these AI technologies are transferable to other digital libraries and search engines. This is different from arXiv, Harvard ADS, and machine cluster to a private cloud using virtualization PubMed, where papers are submitted by authors or techniques (Wu et al. 2014). CiteSeerX extensively pushed by publishers. Unlike Google Scholar and leverages open source software, which significantly Microsoft Academic Search, where a significant portion reduces development effort. Red Hat of documents have only metadata (such as titles, Enterprise Linux (RHEL) 5 and 6 are the operating authors, and abstracts) available, users have full-text systems for all servers. Tomcat 7 is CiteSeerX keeps its own repository, which used for web service deployment on web and indexing serves cached versions of papers even if their previous servers. MySQL is used as the database management links are not alive any more. In additional to system to store metadata. Apache Solr is used paper downloads, CiteSeerX provides automatically for the index, and the Spring framework is used in extracted metadata and citation context, which the web application. In this section, we highlight four AI solutions that are Document metadata download service is not available leveraged by CiteSeerX and that tackle different challenges from Google Scholar and only recently available in metadata extraction and ingestion modules from Microsoft Academic Search. Finally, CiteSeerX (tagged by C, E, D, and A in figure 1).
CiteSeerX: AI in a Digital Library Search Engine
Wu, Jian (The Pennsylvania State University) | Williams, Kyle (The Pennsylvania State University) | Chen, Hung-Hsuan (The Pennsylvania State University) | Khabsa, Madian (The Pennsylvania State University) | Caragea, Cornelia (University of North Texas) | Ororbia, Alexander (The Pennsylvania State University) | Jordan, Douglas (The Pennsylvania State University) | Giles, C. Lee (The Pennsylvania State University)
CiteSeerX is a digital library search engine that provides access to more than 4 million academic documents with nearly a million users and millions of hits per day. Artificial intelligence (AI) technologies are used in many components of CiteSeerX, e.g. to accurately extract metadata, intelligently crawl the web, and ingest documents. We present key AI technologies used in the following components: document classification and deduplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation. These AI technologies have been developed by CiteSeerX group members over the past 5–6 years. We also show the usage status, payoff, development challenges, main design concepts, and deployment and maintenance requirements. While it is challenging to rebuild a system like CiteSeerX from scratch, many of these AI technologies are transferable to other digital libraries and/or search engines.